release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup by xdotli · Pull Request #199 · benchflow-ai/benchflow

xdotli · 2026-04-25T10:36:13Z

Release: v0.3.2

Cuts dev-0.3 → main as the v0.3.2 release. Tag v0.3.2 to be created on main after this merges; CI publishes to PyPI.

What's in 0.3.2

Features

BaseUser abstraction for progressive-disclosure trial loops (feat: BaseUser abstraction + per-task verifier hardening opt-outs #194) — Python callback drives multi-round agent runs; built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. See docs/progressive-disclosure.md.
Per-task [verifier.hardening] opt-outs in task.toml (feat: BaseUser abstraction + per-task verifier hardening opt-outs #194) — tasks with legitimate conftest.py setups (e.g. qutebrowser) opt out of specific cleanup steps. Fixes 5/5 SWE-bench Pro oracle on hardened verifier.
DinD compose ACP via Daytona PTY WebSocket (Fix DinD compose ACP: use Daytona PTY WebSocket for live agent pipes #193) — enables live agent pipes for SkillsBench / DinD compose tasks.
sandbox_setup_timeout wired through configs (feat: wire sandbox_setup_timeout through all configs #180).
OpenHands sandbox launch fix + ACP CLI path (Fix/openhands sandbox launch #182).

Fixes

cfg.agent_env reaches connect_as() (fix: merge cfg.agent_env into connect_as() env resolution #191) — closes Bug: connect_as() ignores cfg.agent_env, re-resolves from empty role.env #190; YAML-supplied provider creds reach the agent.
DinD env-file path mismatch (fix: env-file path mismatch in DinD compose mode #198) — shlex.join() was quoting $$ literally so written/read paths diverged; now uses uuid.uuid4() for unique paths.
Stop copying root tool installs into sandbox home (fix: stop copying root tool installs into sandbox home #181) — closes setup_sandbox_user copies hundreds of MB per trial — use symlinks or shared paths instead #178.
PTY timeout + Daytona logs retry (Infra fixes for SkillsBench Apr 2026 trials: PTY timeout + Daytona logs retry #196) — SkillsBench Apr 2026 trial infra fixes.
Verifier hardening — --rootdir=/app (feat: BaseUser abstraction + per-task verifier hardening opt-outs #194) — anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.

Chores

Repo-wide ruff lint debt cleanup (chore: clean up ruff lint debt across repo #197) — 126 errors → 0; per-file ignores for legitimate inline imports.
Docs: uv tool install (docs: use uv tool install instead of pip install #176).
merge: main → dev-0.3 release prep (merge: main → dev-0.3 (release prep for v0.3.2) #195) — reconciled the 9-commit divergence.

Validation

All 7 release-critical PRs merged into dev-0.3 in sequence
ruff check . clean
Test suite passing modulo 8 pre-existing failures (env-pollution between subscription auth tests, Docker compose env, judge_model default mismatch — none caused by this release)
SWE-bench Pro oracle: 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser)
Single-round Gemini 3.1 Pro baseline: 2/4

Post-merge actions

Tag v0.3.2 on main: git tag -a v0.3.2 -m "benchflow 0.3.2"; git push origin v0.3.2
gh release create v0.3.2 --generate-notes (CI publishes to PyPI)
Bump main pyproject.toml version to 0.3.3.dev0
Delete dev-0.3 branch (going forward: trunk-based, PRs target main)

Test plan

CI runs against the merge commit
Devin reviews
Tag and publish after merge
pip install benchflow==0.3.2 works after CI publishes

…169) * fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: openhands install — use uv tool install or pip install openhands-ai The PyPI package 'openhands' (0.0.0) is a placeholder, not the CLI. The real install is 'uv tool install openhands' (preferred) or 'pip install openhands-ai'. Tries uv first, falls back to pip. Fixes #169 runtime error: 'openhands: command not found' --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

Five fixes for issue #169 (openhands: command not found): 1. PATH: add $HOME/.local/bin to launch_cmd so uv-installed binary is found 2. Interpreter access: chmod o+x on /root path chain so sandbox user can reach the uv-managed Python shebang at /root/.local/share/uv/tools/ 3. ACP auth: seed ~/.openhands/agent_settings.json at install (OpenHands _is_authenticated() requires it) and overwrite with real LLM_MODEL/KEY at launch (workaround for OpenHands ACP not applying --override-with-envs in _create_conversation) 4. Model env: add BENCHFLOW_PROVIDER_MODEL → LLM_MODEL to env_mapping 5. CWD: remove hardcoded cd /home/{user} from build_priv_drop_cmd — it overrode the docker -w /app workspace, causing agents to write files in the wrong directory Also adds home_dirs=[".openhands"] so setup_sandbox_user copies the settings dir to the agent user. Tested: bench eval create + bench run, both sandbox_user=agent and root, gemini agent regression-verified, 45/45 registry+sandbox tests pass.

…enes Multi-role scenes (coder + reviewer) now communicate via outbox files through the main bf.run(TrialConfig) path. Previously, outbox-based message passing only worked through the standalone _scene.py scheduler (used by followup-bench). Now the same convention works end-to-end: 1. Scheduler sets up /app/.outbox/ before the first turn 2. After each turn, reads outbox files written by the active role 3. Injects received messages into the next role's prompt Also includes: - Coder-reviewer demo script (docs/notebooks/coder-reviewer-demo.py) - Real runnable notebook replacing config-only cells with bf.run() calls - Multi-turn vs multi-round terminology in README and api-reference - 7 new tests covering outbox setup, injection, cleanup, and edge cases

1. Quote file paths with shlex.quote() in _read_scene_outbox() to prevent shell command injection via crafted outbox filenames 2. chown /app/.outbox to sandbox_user so agents can actually write outbox files (was root:root 755 → agent couldn't write)

…st gaps 1. Persist inter-role messages to trial_dir/scene_messages.jsonl (was ephemeral — injected into prompts then discarded) 2. Install non-primary agents in connect_as() for heterogeneous scenes (was broken: only primary agent was installed) 3. Honest Harbor mapping — document what 0.3 delivers vs what's a gap: - Shipped: roles, turns, outbox messaging, message persistence - Gap: dynamic termination, oracle access, per-round verification, inter-round trajectory inspection 4. Add 0.3 Limitations section to api-reference 5. Two new tests: message persistence + heterogeneous agent install

All 3 patterns executed end-to-end on regex-log task via Daytona: - Baseline: reward=1.0, 3 tool calls - Self-review (multi-turn): reward=1.0, 7 tool calls - Coder-reviewer (multi-round): reward=0.0, 13 tool calls Outbox messaging confirmed working: reviewer wrote feedback to /app/.outbox/coder.json, scheduler read and injected into coder's prompt. Messages persisted to scene_messages.jsonl.

…primary agents 1. connect_as() now writes credential files and uploads subscription auth for non-primary agents, matching what install_agent() does for the primary agent. Fixes heterogeneous scenes where e.g. codex-acp needs ~/.codex/auth.json. 2. connect_as() now updates self._agent_launch so disconnect()'s pkill fallback targets the correct process (not always the primary agent's binary). 3. Note: the openhands launch_cmd pkill issue (pkill -f 'export') is pre-existing in registry.py, not introduced by this PR.

Tasks requesting more storage than the Daytona tier allows fail at sandbox creation. Apply the same clamping pattern already used for cpus and memory_mb so tasks degrade gracefully. The cap is overridable via BENCHFLOW_DAYTONA_MAX_STORAGE_MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: clamp Daytona storage_mb to configurable max

feat: wire outbox messaging into Trial._run_scene()

* Fix DinD compose exec missing project/directory/file flags DaytonaProcess.start() hardcoded `docker compose exec` without the `-p`, `--project-directory`, and `-f` flags needed to locate the running compose project inside the DinD sandbox. This caused exec to fail silently with "Process closed stdout (rc=None)". Extract the full compose base command from Harbor's strategy via `_compose_cmd([])` during `from_harbor_env()` and use it in `start()` so the exec subcommand includes all required project identifiers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use shlex.join for DinD compose exec to handle paths with spaces Address Devin review feedback — shlex.split() + " ".join() loses quoting for tokens with spaces. Use shlex.join() which properly quotes each token. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- fix: DinD compose exec now includes project/directory/file flags (#188) - fix: clamp Daytona storage_mb to configurable max (#185) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…193) SSH pipes break through the DinD→compose exec chain, causing "Process closed stdout (rc=None)" on all compose tasks. New DaytonaPtyProcess uses Daytona SDK's WebSocket PTY API for the outer connection (keeps pipe alive), then docker compose exec -i -T inside (clean stdio for the agent). Includes marker-based startup to drain shell output before ACP handshake, and echo-resistant response matching in the ACP client (filter echoed requests by checking for 'method' field absence). Also adds skills_dir: "auto" support in Job for per-task skill resolution after PR #720 removed COPY skills from Dockerfiles.

* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.

* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs: add fix plan for connect_as() agent_env bug (#2) * docs: expand fix plan with eng review findings and test cases Add two edge-case test requirements (non-overlapping key merge, None safety) from /plan-eng-review. Append review report confirming 0 issues, 0 critical gaps — ready to implement. * fix: merge cfg.agent_env into connect_as() env resolution (#2) connect_as() passed only role.env to resolve_agent_env, losing all config-level env vars (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML). Merge cfg.agent_env as base with role.env overlay so role-specific vars win on overlap. * remove plan --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* rebase on upstream/0.3 * openhand cli add * enhance api key security * refine tests Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs: use `uv tool install` instead of `pip install` benchflow is a CLI tool with entry points — uv tool install gives users an isolated environment (like pipx) without managing venvs manually. --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>

* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * test: cover sandbox setup timeout wiring * docs: document sandbox setup timeout * feat: wire sandbox setup timeout through configs `setup_sandbox_user()` already accepted a `timeout_sec` kwarg (default 120s) but no live call site surfaced it — the knob was unreachable for normal runs. Under heavy sandbox bootstrap (parallel containers copying large tool caches into /home/<sandbox_user>) the 120s cap was hit with no user override. Add `sandbox_setup_timeout: int = 120` to TrialConfig, JobConfig, and RuntimeConfig, and forward it through: - trial YAML (`trial_config_from_dict`) - job YAML (both native and Harbor-compatible loaders) - `SDK.run(..., sandbox_setup_timeout=...)` - `bench eval create --sandbox-setup-timeout` - `Trial.install_agent()` into both `setup_sandbox_user()` call sites (oracle + normal agent) The value is also recorded in the run's `config.json` snapshot to aid post-hoc diagnosis. Default stays at 120s — this change is about making the value configurable, not changing runtime behavior. --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs(plan): add plan to fix sandbox io problem * test: lock sandbox setup contract Plan step 1/6: Lock the new sandbox contract in tests * fix: stop copying root tool installs into sandbox home Plan step 2/6: Narrow setup_sandbox_user() to user state only * refactor: derive sandbox home dirs from registry config Plan step 3/6: Align registry semantics with the new contract * refactor: symlink skills into sandbox, enforce shared install prefixes Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and add registry + sandbox-setup invariants that keep agent binaries on /usr/local/* rather than /root-only home paths. Updates task-authoring and api-reference docs to describe the new lightweight sandbox contract. * chore: remove completed sandbox plan doc --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

* feat: BaseUser abstraction for progressive-disclosure trial loops Add User as a first-class participant in the trial loop — a Python callback that produces prompts, sees test results between rounds, and decides when to stop. This is the infrastructure Josh (GitHub/Microsoft) needs for SWE-bench Pro progressive disclosure. New types (user.py): - BaseUser with setup(instruction, solution) and run(round, instruction, round_result) - RoundResult dataclass with trajectory, rewards, verifier output - PassthroughUser (backward-compat default, single round) - FunctionUser (wraps a plain callback for lightweight use) Trial changes: - TrialConfig gains user, max_user_rounds, oracle_access fields - Trial._run_user_loop(): user.run() → connect → execute → disconnect → soft_verify() → build RoundResult → repeat until None or max rounds - Trial.soft_verify(): runs Harbor verifier WITHOUT hardening so agent stays alive between rounds. Final verify() still does full hardening. - Multi-role + User raises ValueError (deferred to future phase) 16 new tests, 0 regressions on existing 618 tests. * fix: address self-review — 5 bugs in user abstraction 1. Reorder: disconnect() before soft_verify() — agent process is already dead when soft_verify runs, so soft_verify's docstring was misleading. Now disconnect → soft_verify is the explicit flow. 2. soft_verify() now runs CLEANUP_CMD (conftest/pth/sitecustomize purge) before the verifier. Prevents agent from gaming intermediate test results by injecting test-patching files. 3. FunctionUser: use inspect.isawaitable() instead of asyncio.iscoroutine() — handles asyncio.Task, Future, and any __await__ object, not just coroutines. 4. oracle_access: cat /solution now runs as user="root" — /solution is locked (root:700) after install_agent, so the read would silently fail without root. 5. try/finally around connect/execute/disconnect in user loop — ensures disconnect() always runs even if execute() raises. * feat: add user_dogfood.py — progressive disclosure on regex-log Demonstrates the FunctionUser abstraction: - Round 0: terse 2-sentence prompt - Round 1: hints about edge cases on failure - Round 2: full instruction on continued failure - Stops early if tests pass * fix: address Devin review — remove tautological tests, fix model name - Remove 4 tautological tests (pure dataclass reads) per CLAUDE.md convention: TestRoundResult.test_defaults, test_with_data, TestTrialConfigUser.test_user_field_defaults_to_none, test_user_field_set - Fix dogfood model name: gemini-2.5-flash (not expired preview) - Note: iscoroutine→isawaitable was already fixed in 51d6c61 * fix: address code review — oracle safety, unused import, soft_verify tests 1. Oracle /solution is now moved (not deleted) before agent runs and restored before final verify(). Prevents breaking verifiers that need /solution to compute rewards. 2. Remove unused asyncio import from user.py. 3. Add 4 soft_verify tests: timeout, crash, success, and CLEANUP_CMD execution verification. soft_verify is no longer untested. * feat: dogfood results — progressive disclosure on regex-log via Daytona 3-round progressive disclosure with Gemini Flash on regex-log: Round 0: terse prompt (2 tool calls) → reward=0.0 Round 1: hint prompt (3 tool calls) → reward=0.0 Round 2: full instruction (3 tool calls) → reward=0.0 Final verify: reward=0.0 Agent scored 0.0 on all rounds — regex-log is a hard task. But the infrastructure works end-to-end: user loop, soft_verify, fresh ACP sessions per round, user_rounds.jsonl persistence, final hardened verify. No errors. * feat: add opencode agent to registry OpenCode (opencode-ai) is an open-source TypeScript coding agent with ACP support. Skills path: $HOME/.opencode/skills (updated from .opencode/skill per skillsbench #718). Closes skillsbench #718 for the benchflow side. * fix: opencode ACP returns 0 tool calls — model format mismatch Root cause: OpenCode's ACP parseModel() splits modelId on "/" to extract providerID and modelID. When benchflow sent "gemini-3.1-pro-preview" (no slash), opencode parsed it as providerID="gemini-3.1-pro-preview" with modelID="" — an invalid config that silently returned end_turn. Fix: Add acp_model_format field to AgentConfig. When set to "provider/model" (opencode), _format_acp_model() infers the models.dev provider from the bare model name (e.g. "gemini" → "google") and sends "google/gemini-3.1-pro-preview" to set_model. Also: opencode requires_env is now empty (inferred from model at runtime, not hardcoded to ANTHROPIC_API_KEY). * feat: executed notebook — SWE-bench Pro progressive disclosure analysis OpenCode + gemini-3.1-pro-preview on qutebrowser SWE-bench Pro: Baseline (full prompt, 1 round): 40 tools, 736s, reward=0.0 Progressive (3 rounds): 185 tools, 1154s, reward=0.0 Round 0 (terse): 86 tools (81 bash + 5 edit) Round 1 (hints): 76 tools (66 bash + 10 edit) Round 2 (full): 23 tools (16 bash + 7 edit) Both scored 0.0 due to verifier infrastructure bug (rootdir=/tests instead of /app, pytest couldn't find config). Agent's fixes were likely correct — demonstrated passing tests in own environment. Key findings: - Progressive disclosure changed agent behavior (86→76→23 tools) - _reset_cache implemented only after Round 1 hint - OpenCode handled 185 tool calls without token limits - Verifier rootdir bug needs investigation * fix: replace hand-curated pytest plugin whitelist with auto-discovery The old mechanism (4 dicts + 4 functions + 1 regex) required manual code changes for every new benchmark with an undeclared pytest plugin. SWE-bench Pro tasks failed because pytest-benchmark wasn't whitelisted. New mechanism: one container-side script + one async function. At hardening time, enumerate all pytest11 entry points from root-owned system packages. Only root-owned dist-info directories are trusted — editable installs from agent-writable /testbed are excluded. PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 stays in place. Security preserved. task.toml pytest_plugins kept as fallback. Deleted: _PYTEST_PLUGIN_ALIASES, _PYTEST_OPTION_PLUGINS, _PYTEST_INSTALLED_PLUGINS, _PIP_INSTALL_RE, _normalize_pytest_plugin, _plugins_from_verifier_script, _declared_pytest_plugins, _pytest_plugin_flags, tomllib import. Added: _DISCOVER_PYTEST_PLUGINS_SCRIPT, _discover_pytest_plugin_flags. * fix: handle Python 3.9 importlib.metadata API in plugin discovery Python 3.9's entry_points() doesn't accept keyword arguments — returns a dict instead. Fall back to entry_points().get('pytest11', []) when the keyword style raises TypeError. * fix: simplify plugin discovery — skip ownership check The uid==0 check was failing on Python 3.9 containers where ep.dist._path doesn't exist. Simplified to just enumerate all pytest11 entry points — sandbox_user prevents agent pip installs, so all discovered plugins are image-authored. * feat: updated notebook with fixed-verifier results Both progressive + baseline rerun with working verifier (15 plugins discovered). Results with honest scoring: Progressive (3 rounds): 284 tools, 970s, reward=0.0 Round 0: 94 tools, Round 1: 92 tools, Round 2: 98 tools Baseline (1 round): 73 tools, 611s, reward=0.0 Both failed due to agent code errors (circular imports), not verifier infrastructure. Progressive used 4x more compute for same outcome on this task. * fix: preserve trusted PYTHONPATH entries during verifier hardening VERIFIER_ENV cleared PYTHONPATH="" which broke SWE-bench Pro tasks where the Dockerfile sets PYTHONPATH=/app/lib:/app for project imports. New: _trusted_verifier_pythonpath() filters PYTHONPATH using the same root-owned validation as PATH, but does NOT block the workspace — /app is already importable via CWD/pytest sys.path insertion, so clearing it only breaks imports without security benefit. /tmp, /var/tmp, /home/agent are still blocked. Re-pinned after task-env merge like PATH. * fix: address review comments on BaseUser PR - soft_verify: chmod 777 /logs/verifier so non-root verifier can write - soft_verify: restore /solution before verify, re-hide after (oracle access) - validate empty roles (!=1) and multi-scene configs in user loop - remove tautological test_setup_is_noop - remove opencode BENCHFLOW_PROVIDER_API_KEY→ANTHROPIC_API_KEY mapping (wrong for non-Anthropic models; native keys inherited via auto_inherit_env) - warn on unknown provider fallback in _format_acp_model - remove --rootdir=/tests from VERIFIER_ENV (cherry-pick from PR #187) - fix printenv PYTHONPATH crash when unset - fix stale plugin discovery docstring * feat: add SWE-bench Pro oracle validation + baseline experiment script Runs oracle (gold solution) on all 4 testable tasks to verify the --rootdir fix, then runs a single-round agent baseline for comparison with progressive disclosure. Results to CSV. * fix: address Codex review on PR #184 — oracle safety + warnings Three Codex review findings on the BaseUser abstraction: 1. oracle_access=True with user=None silently leaves /solution exposed to the agent for the entire trial. Add a logger.warning at setup time so misconfigurations surface immediately. 2. Oracle restore (mv /solution_oracle_backup /solution) was outside any finally block. If _run_user_loop() raised, /solution was never restored. Move the user/scene execution into try/finally so the restore always runs before the final verify(). 3. Oracle read used a wildcard fallback (cat /solution/* || true) that could leak unintended files (binaries, credentials). Narrow to solve.sh — the canonical SWE-bench Pro oracle path. Bugs Codex flagged that were FALSE POSITIVES (verified against code): - "session counter reset" — disconnect() already resets both counters - "None instruction" — _resolve_prompts returns [instruction] not [None] Tests still pass: 15 user + 58 sandbox = 73 total. * feat: per-task verifier hardening opt-outs + restore --rootdir=/app Two related changes addressing SWE-bench Pro oracle compatibility: 1) Restore --rootdir=/app in PYTEST_ADDOPTS Removing --rootdir entirely (PR #187) made pytest fall back to /dev as rootdir (from -c /dev/null), producing test node IDs like ../dev/::test_foo instead of <repo>/<path>::test_foo. The verifier expects full-path IDs and reported 0 passing tests on openlibrary even though all 18 tests passed. --rootdir=/app anchors test IDs to the canonical Harbor repo root while -c /dev/null still blocks pyproject.toml/pytest.ini discovery and --confcutdir=/tests still blocks conftest walk-up beyond /tests. 2) Per-task [verifier.hardening] opt-outs in task.toml The cleanup that deletes agent-injected conftest.py also deletes legitimate repo conftest.py files. qutebrowser ships conftest.py that sets up import order to break a real circular dependency between qutebrowser.browser.inspector and qutebrowser.misc.miscwidgets — without them, pytest collection fails on a type annotation in miscwidgets.py:419. Tasks now declare opt-outs in task.toml: [verifier.hardening] cleanup_conftests = false # qutebrowser Defaults remain secure (all True). New helpers in _sandbox.py: - HARDENING_DEFAULTS: dict of feature flags - _read_hardening_config(task_dir): parse task.toml [verifier.hardening] - _build_cleanup_cmd(hardening): build cleanup honoring opt-outs CLEANUP_CMD constant kept as backward-compat alias. Both harden_before_verify() and Trial.soft_verify() now read per-task hardening config before running cleanup. Validation on SWE-bench Pro oracle (Daytona): Before: 2/4 (ansible, flipt) — openlibrary failed test ID format, qutebrowser failed conftest deletion After: 5/5 (ansible, flipt, openlibrary, qutebrowser, navidrome) Tests: 80 passing (15 user + 65 sandbox including 7 new opt-out tests). * docs: add progressive-disclosure guide + SWE-bench Pro demo notebook For Josh's SWE-bench Pro use case (and Harbor #1316 parity in the no-second-LLM case): - docs/progressive-disclosure.md: dedicated guide for the BaseUser abstraction. Covers the API, oracle access, [verifier.hardening] opt-outs, and when to choose BaseUser vs multi-role Scene. - docs/use-cases.md: brief mention in §1 (Interactive User Simulation) pointing to progressive-disclosure.md for the lighter-weight callback-based pattern. - examples/swebench_pro_progressive_disclosure.ipynb: clean rewrite of the existing notebook. Shows the API, oracle 5/5, baseline 4 tasks, per-task hardening opt-out example, and a placeholder cell that auto- loads the latest progressive-disclosure run from /tmp/swebench-pro-jobs/progressive when one exists. Executes top-to- bottom against the current oracle/baseline CSV. - examples/swebench_pro_user_dogfood.py: ready-to-run script for progressive disclosure on any of the 5 working SWE-bench Pro tasks. Three-round user: terse → failing tests + half spec → full spec. - experiments/swebench-pro-results.csv: oracle + baseline results from 2026-04-24 Daytona run. qutebrowser entry is pre-fix (verified post- fix separately, noted in notebook). * docs: add progressive-disclosure.md to CLAUDE.md docs index

…gs retry (#196) * Bump DaytonaPtyProcess readline timeout 300s→900s Long-running TTS/audio tasks (e.g. pg-essay-to-audiobook) generate extended quiet periods on stdout while ffmpeg/whisper run. The 300s PTY readline timeout fires before the per-task agent timeout (900s), prematurely killing healthy runs. Align readline timeout with the standard agent timeout so the PTY only fails when the inner process is actually wedged. * Daytona SDK: retry SessionCommandLogsResponse ValidationError The Daytona server occasionally returns an empty string instead of a JSON object when fetching session command logs, which causes pydantic to raise ValidationError inside AsyncProcess.get_session_command_logs. We've reproduced this on SDK 0.168.x and 0.169.x; the surface is most visible in skillsbench tasks that ask the verifier for command output (e.g. latex-formula-extraction). Patch the SDK method at runtime with a small bounded retry. After four malformed payloads we fall back to an empty (but valid) response so callers can still inspect exit_code via get_session_command — a silent missing-logs is preferable to taking a whole trial as ERROR on an upstream marshalling glitch. Patch is applied lazily from _create_environment so we never touch the SDK on Docker-only runs. * Daytona retry: catch DaytonaError wrapping the malformed-logs ValidationError The first version of this patch only matched on pydantic ValidationError, but AsyncProcess.get_session_command_logs is decorated by intercept_errors at class-definition time — every inner exception is converted to DaytonaError before our patched bound method ever sees it. Verified against latex-formula-extraction on Daytona: the patch wrapper was being called, but the except-clause never matched, so the run still failed. Match on DaytonaError whose message contains 'SessionCommandLogsResponse' in addition to bare ValidationError, and drop the wrapper to 2 attempts (harbor already wraps the call in tenacity x3 — extra retries here are wasted on a deterministic malformed payload). Empty-fallback unchanged.

* fix: env-file path mismatch in DinD compose mode Devin caught a real bug introduced by PR #193 (DinD compose ACP): src/benchflow/process.py:325 sets remote_env_path = "/tmp/benchflow_env_$$.env" expecting the remote shell to expand $$ to its PID. But shlex.join() at line 329 single-quotes the --env-file argument, so docker compose receives the literal string "/tmp/benchflow_env_$$.env" while the cat heredoc that writes the file (line 339, raw f-string) does expand $$. The file is written to /tmp/benchflow_env_<pid>.env and read from /tmp/benchflow_env_$$.env — silent mismatch, env vars (incl. API keys) silently dropped in DinD compose tasks. Fix: use uuid.uuid4().hex[:16] for the unique suffix instead of relying on shell-side $$ expansion. The path is then a literal that survives quoting. Apply the same fix to the direct (non-DinD) Daytona branch even though it was working — uniformity makes the path robust against future quoting changes. Also fix a pre-existing SIM103 lint error in _daytona_patches.py that ruff caught while validating the test changes. Tests: tests/test_process.py +2 regression tests pinning that no remote command contains a literal "$$" — would catch this exact regression. 8/8 process tests pass; ruff clean. * test: reference PR #193 / #198 in regression test docstring Devin caught: CLAUDE.md mandates regression tests name the commit/PR they guard. Updated TestDaytonaProcessEnvFilePath docstring to cite PR #198 (the fix) and PR #193 / commit cdccac7 (the regression).

# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_oracle_chokepoint.py

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 8 additional findings.

xdotli and others added 28 commits April 21, 2026 13:39

release: benchflow 0.3.0

8540fda

release: benchflow 0.3.1 — fix openhands install

d3345dd

fix: openhands install — bootstrap curl + uv in bare Ubuntu sandboxes

3ee1ade

fix: openhands install — set PATH before command -v check

21b356d

Merge pull request #185 from benchflow-ai/fix/daytona-storage-clamp

faeb0d2

fix: clamp Daytona storage_mb to configurable max

Merge pull request #179 from benchflow-ai/feat/scene-outbox-messaging

94c05cf

feat: wire outbox messaging into Trial._run_scene()

release: benchflow 0.3.2 — Daytona DinD fix + storage clamp

ea1c728

- fix: DinD compose exec now includes project/directory/file flags (#188) - fix: clamp Daytona storage_mb to configurable max (#185) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix/openhands sandbox launch (#182)

871bd21

* rebase on upstream/0.3 * openhand cli add * enhance api key security * refine tests Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>

Merge remote-tracking branch 'origin/main' into release-0.3.2-rebase

e77e190

# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_oracle_chokepoint.py

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

xdotli merged commit 23b4de4 into main Apr 25, 2026
2 of 3 checks passed

xdotli deleted the dev-0.3 branch April 25, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup#199

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup#199
xdotli merged 28 commits intomainfrom
dev-0.3

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xdotli commented Apr 25, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release: v0.3.2

What's in 0.3.2

Features

Fixes

Chores

Validation

Post-merge actions

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading